feat(middleware): Model routing, PII filtering, Cloud model proxies#9802
Open
richiejp wants to merge 2 commits into
Open
feat(middleware): Model routing, PII filtering, Cloud model proxies#9802richiejp wants to merge 2 commits into
richiejp wants to merge 2 commits into
Conversation
aff5af4 to
8389d96
Compare
99f79f4 to
d8b32b7
Compare
Big-bang squash-friendly commit covering the work since master:
phases 1-7 of the cloud-proxy migration, tool-call support, plus
the surrounding routing / middleware / PII / billing scaffolding
this branch had been carrying.
Cloud-proxy backend (backend/go/cloud-proxy/):
* New gRPC backend with two modes.
* Passthrough: Forward RPC shovels raw HTTP between client and
upstream so the wire format is preserved byte-for-byte.
* Translate: PredictRich / PredictStreamRich convert internal
proto to OpenAI Chat Completions or Anthropic Messages,
preserving tool calls + usage tokens through pb.Reply.
* API keys resolved from api_key_env or api_key_file (mutually
exclusive), never stored in YAML.
gRPC interface (pkg/grpc/):
* Forward bidi RPC added to Backend proto.
* AIModelRich optional extension interface returning *pb.Reply
so backends can surface tool_calls and usage tokens.
* Fixed forwardClient.CloseSend prematurely closing the gRPC
connection — caught by e2e tests. Cleanup now fires on stream
end (Recv error/EOF) instead.
Core integration:
* IsCloudProxyBackendPassthrough hook in chat + Anthropic
endpoints; legacy "proxy-*" backend prefix removed (hard
cutover — nothing released).
* cloudproxy.ForwardViaBackend + cloudproxy.BuildStreamFilter
shared by both endpoint families.
* PII filter applies to translate mode via the standard
streaming pipeline; verified by e2e.
Routing + middleware (carried from earlier on the branch):
* Score / Rerank / Embedder / VectorStore interfaces in
core/backend with Application factory methods.
* Router with score classifier, depth-1 invariant, embedding
cache, PII config, billing recorder.
* Admission middleware, route-model dispatch, usage stamping.
* MITM proxy + CA management for intercepting cloud traffic.
* Middleware admin page in the React UI.
Local-store backend rewrite + tests covering Set / Get /
Delete / Find invariants.
Llama-cpp Score concurrency guard: conflict_guard tripwire
plus FLAG_SCORE/{CHAT,COMPLETION,EMBEDDINGS} validation rule
in core/config.
Tests: 60+ new unit tests across cloud-proxy backend, cloudproxy
core glue, gRPC server + AIModelRich dispatch, config validation,
and 6 e2e specs that stand up a real two-process gRPC link with
fake upstreams (gaps mudler#1/mudler#2/mudler#3 from review).
Docs: cloud-proxy.md, middleware.md, mitm-proxy.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
`go build ./...` (and other multi-package builds that include backend/go/cloud-proxy or backend/go/local-store) writes a binary named after the package directory into the working directory. Add both names to the existing root-binary ignore block so the working tree stays clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d8b32b7 to
d82ad5c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Allows analyzing requests then routing, filtering and transforming them.
Chat requests can be classified and labelled as requiring particular capabilities.
Then routed to the model which satisfies all of the capabilities. Naturally requests that require fewer capabilities can be handled by smaller specialized models. In addition the classifier chooses more capabilities the more uncertain it is, routing difficult requests to larger general purpose models.
Classification is very fast, but once requests have been classified their embeddings can be used to avoid classifying similar requests. This works by labelling the embeddings of past requests and then doing a cosine similarity search on the embeddings of new requests.
Private information can be detected, when it is found in the request, the request can be modified to redact it,
routed differently or it can be blocked.
Cloud models and a MITM proxy can be configured and take part in filtering and routing.
This allows sending easy requests to smaller local models and hard ones to cloud models.
The MITM proxy allows you to use Claude Code or Codex subscriptions (OAuth) with the PII
filter and potentially even with routing (although this is limited by the cloud providers ToS).
Routing classifies requests using a model such as ArchRouter which labels a request.
We score each request on the possible capabilities it may require and pick a model which
has all of the capabilities with scores towards the top of the distribution.
The ability to score multiple choices is an interesting feature in its own right.
It allows you to very quickly check with what probability an LLM would produce a particular
answer.